We will re-use the colon cancer data in GSE33126

Importing the data

library(GEOquery)
url <- "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SeriesMatrix/GSE33126/"
filenm <- "GSE33126_series_matrix.txt.gz"
if(!file.exists("GSE33126_series_matrix.txt.gz")) download.file(paste(url, filenm, sep=""), destfile=filenm)
colonData <- getGEO(filename=filenm)
## File stored at: 
## /tmp/Rtmpf4Tli9/GPL6947.soft
colonData
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 48803 features, 18 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM820516 GSM820517 ... GSM820533 (18 total)
##   varLabels: title geo_accession ... data_row_count (31 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_1343291 ILMN_1343295 ... ILMN_2416019 (48803
##     total)
##   fvarLabels: ID nuID ... GB_ACC (30 total)
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL6947
exprs (colonData) <- log2 (exprs(colonData))
SampleGroup <- pData(colonData)$source_name_ch1
Patient <- pData(colonData)$characteristics_ch1.1

Filtering the data

A first step in the analysis of microarray data is often to remove any uniformative probes. We can do this because typically only 50% of probes genes will be expressed, and even fewer than this will be differentially expressed. Including such non-informative genes in visualisa- tion will obscure any biological differences that exist. The genefilter package contains a suite of tools for the filtering of microarray data. The varFilter function allows probes with low- variance to be removed from the dataset. The metric using to decide which probes to remove is the Inter-Quartile Range (IQR), and by default half of the probes are removed. Both the function used to do the filtering, and cut-off can be specified by the user.

library (genefilter)
## 
## Attaching package: 'genefilter'
## 
## The following object is masked from 'package:base':
## 
##     anyNA
dim (colonData)
## Features  Samples 
##    48803       18
varFiltered <- varFilter (colonData)
dim (varFiltered)
## Features  Samples 
##    24401       18
nrow (colonData) / nrow (varFiltered)
## Features 
## 2.000041

Calculating a distance matrix

The first step towards clustering the data is to construct a distance matrix. Each entry in this matrix is the pairwise distance between two samples. Note that for expression data we have to transpose the standard ExpressionSet format, as clustering is designed to work on rows rather than columns. The standard function to make a distance matrix in R is dist which uses the euclidean metric. As the data entries in a distance matrix are symmetrical, it has a different representation to the default matrix (i.e. values are not repeated unnecessarily), and clustering algorithms are able to accept this representation as input.

euc.dist <- dist (t(exprs(varFiltered)))
euc.dist
##           GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521
## GSM820517 146.35076                                                  
## GSM820518 112.10192 145.53738                                        
## GSM820519 127.02787 111.88327 128.89085                              
## GSM820520 103.88617 145.88415  93.26812 124.41043                    
## GSM820521 115.74149 109.59666 112.44568  77.60015 110.62329          
## GSM820522  86.59678 140.12151 115.68936 131.12234  98.96944 117.26705
## GSM820523 118.07552 115.02450 139.86229 101.32561 134.43368  89.05971
## GSM820524 113.05465 135.64296 106.83091 126.25701  85.61877 117.19595
## GSM820525 142.82777  72.50357 145.43070 101.26515 138.18094 107.93645
## GSM820526  87.68842 160.67923 109.56682 126.28151 102.83077 114.59309
## GSM820527 102.51958 121.92622 119.62366  78.01042 110.89035  69.13310
## GSM820528  85.71746 132.44295  96.04791 115.36869  75.14097 103.00987
## GSM820529 107.07671 118.90261 120.08406  94.25470 114.46866  79.01849
## GSM820530  80.36987 151.46130  94.29680 123.97811  86.59315 103.45768
## GSM820531 108.26678 100.31635 115.03569  82.27479 112.72244  64.87145
## GSM820532 128.93571  95.33138 115.04522 106.11806 116.81059 102.43630
## GSM820533 106.21944 134.12719 115.78666  97.39843 115.37280  68.75089
##           GSM820522 GSM820523 GSM820524 GSM820525 GSM820526 GSM820527
## GSM820517                                                            
## GSM820518                                                            
## GSM820519                                                            
## GSM820520                                                            
## GSM820521                                                            
## GSM820522                                                            
## GSM820523 105.62731                                                  
## GSM820524  85.85654 116.10403                                        
## GSM820525 127.62848  99.13618 120.46802                              
## GSM820526 107.40340 134.03152 114.20701 154.74502                    
## GSM820527 106.32390  77.38321 115.56748 109.20371 104.05182          
## GSM820528  89.80380 112.44894  90.45972 124.46834  97.24261  90.59615
## GSM820529 116.33941  83.80927 123.55735 107.46869 119.74686  59.39630
## GSM820530  97.19378 126.82149 107.83406 146.18764  87.43164 101.35185
## GSM820531 116.26826  91.51850 122.61883 103.07141 119.87820  70.55040
## GSM820532 124.92807 122.94258 116.57744 103.39879 137.24907 113.49486
## GSM820533 116.89886  93.17867 127.75562 130.21971 109.87582  64.16850
##           GSM820528 GSM820529 GSM820530 GSM820531 GSM820532
## GSM820517                                                  
## GSM820518                                                  
## GSM820519                                                  
## GSM820520                                                  
## GSM820521                                                  
## GSM820522                                                  
## GSM820523                                                  
## GSM820524                                                  
## GSM820525                                                  
## GSM820526                                                  
## GSM820527                                                  
## GSM820528                                                  
## GSM820529  86.75871                                        
## GSM820530  81.22235 103.68352                              
## GSM820531  96.07646  70.24274  99.39146                    
## GSM820532 105.14902 109.33036 120.61473  93.09175          
## GSM820533  99.91990  68.56935  96.27636  73.49531 119.19071

For gene-expression data, it is common to use correlation as a distance metric rather than the Euclidean. You should make sure that you know the difference between the two metrics. The cor function can be used to calculate the correlation of columns in a matrix. Each row (or column) in the resulting matrix is the correlation of that sample with all other samples in the dataset. The matrix is symmetrical and we can transform this into a distance matrix by first subtracting 1 from the correlation matrix. Hence, samples with a higher correlation have a smaller ’distance’.

corMat <- cor(exprs(varFiltered))
corMat
##           GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521
## GSM820516 1.0000000 0.8582704 0.9171976 0.8937222 0.9286023 0.9112575
## GSM820517 0.8582704 1.0000000 0.8600812 0.9173538 0.8588151 0.9202012
## GSM820518 0.9171976 0.8600812 1.0000000 0.8907537 0.9425576 0.9163944
## GSM820519 0.8937222 0.9173538 0.8907537 1.0000000 0.8978211 0.9602207
## GSM820520 0.9286023 0.8588151 0.9425576 0.8978211 1.0000000 0.9187180
## GSM820521 0.9112575 0.9202012 0.9163944 0.9602207 0.9187180 1.0000000
## GSM820522 0.9503610 0.8696543 0.9115547 0.8864231 0.9349881 0.9085925
## GSM820523 0.9081569 0.9126298 0.8713382 0.9324985 0.8806670 0.9475820
## GSM820524 0.9156037 0.8781899 0.9247661 0.8949583 0.9514800 0.9089643
## GSM820525 0.8648318 0.9650722 0.8601046 0.9322218 0.8731558 0.9224893
## GSM820526 0.9493074 0.8293478 0.9209805 0.8950744 0.9301280 0.9131154
## GSM820527 0.9302064 0.9009531 0.9051381 0.9597308 0.9180982 0.9681213
## GSM820528 0.9511778 0.8829546 0.9388145 0.9117054 0.9623691 0.9291090
## GSM820529 0.9241083 0.9061505 0.9047172 0.9413361 0.9130390 0.9584997
## GSM820530 0.9573796 0.8482370 0.9414236 0.8987844 0.9504108 0.9291209
## GSM820531 0.9223823 0.9331725 0.9125297 0.9552921 0.9156400 0.9720189
## GSM820532 0.8897132 0.9395367 0.9123688 0.9254827 0.9092395 0.9300936
## GSM820533 0.9255052 0.8809038 0.9116292 0.9374967 0.9118986 0.9686817
##           GSM820522 GSM820523 GSM820524 GSM820525 GSM820526 GSM820527
## GSM820516 0.9503610 0.9081569 0.9156037 0.8648318 0.9493074 0.9302064
## GSM820517 0.8696543 0.9126298 0.8781899 0.9650722 0.8293478 0.9009531
## GSM820518 0.9115547 0.8713382 0.9247661 0.8601046 0.9209805 0.9051381
## GSM820519 0.8864231 0.9324985 0.8949583 0.9322218 0.8950744 0.9597308
## GSM820520 0.9349881 0.8806670 0.9514800 0.8731558 0.9301280 0.9180982
## GSM820521 0.9085925 0.9475820 0.9089643 0.9224893 0.9131154 0.9681213
## GSM820522 1.0000000 0.9262929 0.9511794 0.8917089 0.9237271 0.9246435
## GSM820523 0.9262929 1.0000000 0.9111565 0.9350314 0.8817770 0.9603706
## GSM820524 0.9511794 0.9111565 1.0000000 0.9037952 0.9139681 0.9112436
## GSM820525 0.8917089 0.9350314 0.9037952 1.0000000 0.8415122 0.9204252
## GSM820526 0.9237271 0.8817770 0.9139681 0.8415122 1.0000000 0.9282016
## GSM820527 0.9246435 0.9603706 0.9112436 0.9204252 0.9282016 1.0000000
## GSM820528 0.9461775 0.9161095 0.9455812 0.8964670 0.9372344 0.9449786
## GSM820529 0.9101066 0.9536168 0.8988913 0.9232237 0.9051941 0.9764954
## GSM820530 0.9374851 0.8940675 0.9232363 0.8584350 0.9496136 0.9318156
## GSM820531 0.9101826 0.9446634 0.9003839 0.9293514 0.9049499 0.9668194
## GSM820532 0.8961047 0.8999407 0.9097986 0.9287621 0.8751776 0.9139221
## GSM820533 0.9094910 0.9427849 0.8921813 0.8875923 0.9203758 0.9726729
##           GSM820528 GSM820529 GSM820530 GSM820531 GSM820532 GSM820533
## GSM820516 0.9511778 0.9241083 0.9573796 0.9223823 0.8897132 0.9255052
## GSM820517 0.8829546 0.9061505 0.8482370 0.9331725 0.9395367 0.8809038
## GSM820518 0.9388145 0.9047172 0.9414236 0.9125297 0.9123688 0.9116292
## GSM820519 0.9117054 0.9413361 0.8987844 0.9552921 0.9254827 0.9374967
## GSM820520 0.9623691 0.9130390 0.9504108 0.9156400 0.9092395 0.9118986
## GSM820521 0.9291090 0.9584997 0.9291209 0.9720189 0.9300936 0.9686817
## GSM820522 0.9461775 0.9101066 0.9374851 0.9101826 0.8961047 0.9094910
## GSM820523 0.9161095 0.9536168 0.8940675 0.9446634 0.8999407 0.9427849
## GSM820524 0.9455812 0.8988913 0.9232363 0.9003839 0.9097986 0.8921813
## GSM820525 0.8964670 0.9232237 0.8584350 0.9293514 0.9287621 0.8875923
## GSM820526 0.9372344 0.9051941 0.9496136 0.9049499 0.8751776 0.9203758
## GSM820527 0.9449786 0.9764954 0.9318156 0.9668194 0.9139221 0.9726729
## GSM820528 1.0000000 0.9497681 0.9561975 0.9383683 0.9260060 0.9335861
## GSM820529 0.9497681 1.0000000 0.9288649 0.9672203 0.9204350 0.9688689
## GSM820530 0.9561975 0.9288649 1.0000000 0.9346090 0.9035232 0.9388144
## GSM820531 0.9383683 0.9672203 0.9346090 1.0000000 0.9422943 0.9642204
## GSM820532 0.9260060 0.9204350 0.9035232 0.9422943 1.0000000 0.9057118
## GSM820533 0.9335861 0.9688689 0.9388144 0.9642204 0.9057118 1.0000000
cor.dist <- as.dist(1 - corMat)

Hierachical clustering

Hierachical clustering is the method by which samples with similar profiles are grouped to- gether, based on their distances. The method for grouping samples can be specified, with the default being to use complete linkage. Different linkage methods can affect the properties of the clustering output. e.g. the ’ward’ method tends to produce smaller, compact clusters. A popular way of displaying the results of clustering is a dendrogram, which arranges the sam- ples on the x-axis and shows the distances between samples on the y-axis. It is important to note that the distance of samples on the x-axis has no meaning. The fact that two samples are displayed next to each other, does not m

par(mfrow = c (1 , 2))
clust.euclid = hclust(euc.dist)
clust.cor = hclust (cor.dist)
par (mfrow = c (1 , 2))
plot(clust.euclid , label = SampleGroup )
plot(clust.cor , label = SampleGroup )

Extracting data from the clustering

If we want to interpret the data presented in a clustering analysis, we need a way of extract- ing which samples are grouped together, or to determine the optimal grouping of samples. One intuitive way of assigning groups it to ’cut’ the dendrogram at a particular height on the y-axis. We can do this manually on the plot, or use the cutree function to return the labels of samples that are belong to the same group when the dendrogram is cut at the specified height, h. Alternatively, we can specify how many groups, k, that we want to create.

library (cluster)
par (mfrow = c(1 , 1))
plot(clust.cor)
abline (h = 0.12, col = " red ")

cutree (clust.cor , h = 0.12)
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522 
##         1         2         1         2         1         2         1 
## GSM820523 GSM820524 GSM820525 GSM820526 GSM820527 GSM820528 GSM820529 
##         2         1         2         1         2         1         2 
## GSM820530 GSM820531 GSM820532 GSM820533 
##         1         2         2         2
cutree (clust.cor , k = 2)
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522 
##         1         2         1         2         1         2         1 
## GSM820523 GSM820524 GSM820525 GSM820526 GSM820527 GSM820528 GSM820529 
##         2         1         2         1         2         1         2 
## GSM820530 GSM820531 GSM820532 GSM820533 
##         1         2         2         2
table (cutree(clust.cor , k = 3) , SampleGroup)
##    SampleGroup
##     normal tumor
##   1      0     8
##   2      2     1
##   3      7     0

A Silhouette plot can be used to choose the optimal number of clusters. For each sample, we calculate a value that quantifies how well it ’fits’ the cluster that it has been assigned to. If the value is around 1, then the sample closely fits other samples in the same cluster. How- ever, if the value is around 0 the sample could belong to another cluster. In the silhouette plot, the values for each cluster are plotted together and ordered from largest to smallest. The number of samples belonging to each group is also displayed.

par (mfrow = c (2 , 2))
plot (silhouette(cutree(clust.cor, k=2),cor.dist),
      col="red",main=paste("k=",2))
plot (silhouette(cutree(clust.cor, k=3),cor.dist),
      col="red",main=paste("k=",3))
plot (silhouette(cutree(clust.cor, k=4),cor.dist),
      col="red",main=paste("k=",4))
plot (silhouette(cutree(clust.cor, k=5),cor.dist),
      col="red",main=paste("k=",5))

If we have prior knowledge of how many clusters to expect, we could run the clustering in a supervised manner. The Partition Around Medioids method can be used to group samples into k clusters.

pam.clus <- pam (euc.dist , k = 2)
clusplot (pam.clus)

pam.clus$clustering
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522 
##         1         2         1         2         1         2         1 
## GSM820523 GSM820524 GSM820525 GSM820526 GSM820527 GSM820528 GSM820529 
##         2         1         2         1         2         1         2 
## GSM820530 GSM820531 GSM820532 GSM820533 
##         1         2         2         2
table(pam.clus$clustering , SampleGroup)
##    SampleGroup
##     normal tumor
##   1      0     8
##   2      9     1

Producing a heatmap

A heatmap is often used to visualise differences between samples. Each row represents a gene and each column is an array and coloured cells indicate the expression levels of genes. Both samples and genes with similar expression profile are clustered together. By default, euclidean distances are used with complete linkage clustering. Drawing a heatmap in R uses a lot of memory and can take a long time, therefore reducing the amount of data to be plotted is usually recommended. Including too many non- informative genes can also make it difficult to spot patterns. Typically, data are filtered to include the genes which tell us the most about the biological variation. In an un-supervised setting, the selection of such genes is done without using prior knowledge about the sample groupings.

IQRs = apply (exprs(varFiltered) , 1 , IQR )
highVarGenes = order (IQRs, decreasing = T )[1:100]
Symbols <- as.character(fData(colonData)$Symbol[highVarGenes])
heatmap (as.matrix(exprs(varFiltered)[highVarGenes, ]),
         labCol = SampleGroup , labRow = Symbols)

The default options for the heatmap are to cluster both the genes (rows) and samples (columns). However, sometimes we might want to specify a particular order. For example, we might want to order the columns according to sample groups. We can do this by re-ordering the input matrix manually and setting the Colv argument to NA. This tells the heatmap function not be cluster the columns. It choosing this option, we need to be careful that any column colourings or labels are set to the same ordering. Alternatively, a pre-calculated dendrogram could be used.

clus.ward <- hclust (cor.dist , method = "ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
heatmap (as.matrix(exprs(varFiltered)[highVarGenes, ]) ,
         Colv = as.dendrogram(clus.ward) , labCol = SampleGroup )

Customising the heatmap

The heatmap function can be customised in many ways to make the output more informa- tive. For example, the labRow and labCol parameters can be used to give labels to the rows (genes) and columns (sample) of the heatmap. Similarly, ColSideColors and RowSideColors give coloured labels, often used to indicate different groups which are know in advance. See the help page for heatmap for more details.

labs <- as.factor(SampleGroup)
levels(labs) <- c ("yellow" , "blue")
heatmap(as.matrix(exprs(varFiltered)[highVarGenes, ]) ,
        labCol = Patient, ColSideColors = as.character(labs),
        labRow = Symbols)

The colours used to display the gene expression values can also be modified. For this, we can use the RColorBrewer package which has functions for creating pre-defined palettes. The function display.brewer.all can be used to display the palettes available through this package.

library (RColorBrewer)
display.brewer.all()

hmcol <- brewer.pal(11 , "RdBu")
heatmap (as.matrix(exprs(varFiltered)[highVarGenes, ]) ,
  ColSideColors = as.character(labs) , labRow = Symbols,
  col=hmcol)

You should avoid using the traditional red / green colour scheme as it may be difficult for people with colour-blindness to interpret

A popular use for heatmaps is to take an existing gene list (e.g. genes found to be significant in a previous study, or genes belonging to a particular pathway) and produce an image of how they cluster the data for exploratory purposes. This can be achieved by selecting the correct rows in the data matrix.

Here we use the pre-built annotation package for the particular array platform (illuminaHumanv3.db). These packages contain mappings between the manufacturer identifiers, and more-common genomic identifiers. The illuminaHumanv3PATH mapping will take a set of illumina IDs and return the KEGG pathway for each. However, if we want the Illumina IDs for a particular KEGG pathway, we have to reverse the mapping.

library (illuminaHumanv3.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: S4Vectors
## Loading required package: org.Hs.eg.db
## Loading required package: DBI
pathwayGenes <- unlist (mget("04110" , revmap(illuminaHumanv3PATH)))
pathwayGenes <- pathwayGenes [pathwayGenes %in% featureNames(varFiltered)]
symbols <- fData(varFiltered)[pathwayGenes , "Symbol"]
heatmap (as.matrix(exprs(varFiltered)[pathwayGenes , ]),
  ColSideColors = as.character (labs) , labCol = Patient ,
  labRow = symbols , col = hmcol )

Later on, we will show how other identifier mapping can be performed using so-called organism packages.

Principal Components Analysis

Principal components analysis (PCA) is a data reduction technique that allows us to simplify multidimensional data sets to 2 or 3 dimensions for plotting purposes and identify which factors explain the most variability in the data. We can use the prcomp function to perform a PCA and we have to supply it with a distance matrix. The resulting object contains information on the proportion of variance that is ’explained’ by each component. Ideally, we want to see that the majority of the variance is explained by the first 2 or 3 components, and that these components are associated with a biological factor

pca <- prcomp(exprs(varFiltered))
plot(pca)

summary(pca)
## Importance of components:
##                          PC1     PC2     PC3    PC4     PC5     PC6
## Standard deviation     7.171 1.13645 0.84320 0.6838 0.53334 0.50432
## Proportion of Variance 0.924 0.02321 0.01278 0.0084 0.00511 0.00457
## Cumulative Proportion  0.924 0.94724 0.96002 0.9684 0.97353 0.97810
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.44891 0.40378 0.38232 0.36830 0.32469 0.30894
## Proportion of Variance 0.00362 0.00293 0.00263 0.00244 0.00189 0.00172
## Cumulative Proportion  0.98172 0.98465 0.98728 0.98972 0.99161 0.99333
##                           PC13    PC14    PC15    PC16    PC17   PC18
## Standard deviation     0.27589 0.27127 0.25520 0.24435 0.22922 0.2104
## Proportion of Variance 0.00137 0.00132 0.00117 0.00107 0.00094 0.0008
## Cumulative Proportion  0.99469 0.99602 0.99719 0.99826 0.99920 1.0000

The ggplot2 package is convenient for plotting principal components results as it allows many different covariates to be overlaid on the plot. See http://ggplot2.org/ for more informa- tion on this package.

library(ggplot2)
clusLabs <- cutree(clust.cor , k = 3)
pcRes <- data.frame(pca$rotation , SampleGroup , Sample = Patient)
ggplot (pcRes , aes(x = PC1 , y = PC2 , col = SampleGroup ,
                       label = Patient , pch = as.factor(clusLabs))) + geom_point () +
      geom_text(vjust=0,alpha=0.5)

Classification

The Bioconductor project has a collection of example datasets. Often these are used as examples to illustrate a particular package or functionality, or to accompany the analysis presented in a publication. For example, several datasets are presented to accompany the genefu package which has functions useful for the classification of breast cancer patients based on expression profiles. An experimental dataset can be installed and loaded as with any other Bioconductor package. The data itself is saved as an object in the package. You will need to see the documentation for the package to find out the relevant object name. The full list of datasets available through Bioconductor can be found here

library(breastCancerVDX)
library(breastCancerTRANSBIG)
data(vdx)
data(transbig)
dim(vdx)
## Features  Samples 
##    22283      344
dim(transbig)
## Features  Samples 
##    22283      198
annotation(vdx)
## [1] "hgu133a"
annotation(transbig)
## [1] "hgu133a"

If we want any classifers to be reproducible and applicable to other datasets, it is sensible to exclude probes that do not have sufficient annotation from the analysis. For this, we can use the genefilter package as before. The nsFilter function performs this annotation-based filtering as well as variance filtering. The output of the function includes details about how many probes were removed at each stage of the filtering.

library (genefilter)
vdx.filt <- nsFilter(vdx)
## 
vdx.filt
## $eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 6221 features, 344 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: VDX_3 VDX_5 ... VDX_2038 (344 total)
##   varLabels: samplename dataset ... e.os (21 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 206797_at 203440_at ... 201130_s_at (6221 total)
##   fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 17420468 
## Annotation: hgu133a 
## 
## $filter.log
## $filter.log$numDupsRemoved
## [1] 7408
## 
## $filter.log$numLowVar
## [1] 6221
## 
## $filter.log$numRemoved.ENTREZID
## [1] 2423
## 
## $filter.log$feature.exclude
## [1] 10
vdx.filt <- vdx.filt[[1]]

Format the vdx data for pamr, and train a classifier to predict ER status. For extra clarity in the results, it might be useful to rename the binary er status used in the data package to something more descriptive.

library(pamr)
## Loading required package: survival
dat <- exprs(vdx.filt)
gN <- as.character(fData(vdx.filt)$Gene.symbol)
gI <- featureNames (vdx.filt)
sI <- sampleNames (vdx.filt)
erStatus <- pData (vdx)$er
erStatus <- gsub (0 , "ER -" , erStatus )
erStatus <- gsub (1 , "ER +" , erStatus )

Fitting the model

train.dat <- list ( x = dat , y = erStatus , genenames = gN ,
              geneid = gI , sampleid = sI )
model <- pamr.train(train.dat ,n.threshold = 100)
## 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
model
## Call:
## pamr.train(data = train.dat, n.threshold = 100)
##     threshold nonzero errors
## 1    0.000    6221    41    
## 2    0.130    5803    41    
## 3    0.260    5365    39    
## 4    0.390    4926    39    
## 5    0.519    4501    39    
## 6    0.649    4157    38    
## 7    0.779    3789    38    
## 8    0.909    3440    38    
## 9    1.039    3144    38    
## 10   1.169    2884    36    
## 11   1.299    2624    34    
## 12   1.428    2377    34    
## 13   1.558    2197    34    
## 14   1.688    2000    33    
## 15   1.818    1823    34    
## 16   1.948    1663    33    
## 17   2.078    1538    33    
## 18   2.208    1405    34    
## 19   2.337    1281    33    
## 20   2.467    1164    33    
## 21   2.597    1067    32    
## 22   2.727     994    32    
## 23   2.857     910    33    
## 24   2.987     823    33    
## 25   3.117     753    33    
## 26   3.246     697    32    
## 27   3.376     643    31    
## 28   3.506     577    32    
## 29   3.636     522    33    
## 30   3.766     460    33    
## 31   3.896     398    33    
## 32   4.026     360    33    
## 33   4.155     319    33    
## 34   4.285     284    33    
## 35   4.415     246    33    
## 36   4.545     219    33    
## 37   4.675     191    33    
## 38   4.805     170    34    
## 39   4.935     155    34    
## 40   5.064     132    34    
## 41   5.194     117    35    
## 42   5.324     103    35    
## 43   5.454      89    36    
## 44   5.584      77    36    
## 45   5.714      68    36    
## 46   5.844      63    36    
## 47   5.973      57    36    
## 48   6.103      54    37    
## 49   6.233      49    37    
## 50   6.363      44    37    
## 51   6.493      41    37    
## 52   6.623      37    37    
## 53   6.753      34    36    
## 54   6.882      32    37    
## 55   7.012      28    37    
## 56   7.142      27    37    
## 57   7.272      25    38    
## 58   7.402      23    39    
## 59   7.532      19    41    
## 60   7.661      17    42    
## 61   7.791      17    42    
## 62   7.921      17    43    
## 63   8.051      17    43    
## 64   8.181      16    43    
## 65   8.311      15    44    
## 66   8.441      13    43    
## 67   8.570      11    48    
## 68   8.700       8    48    
## 69   8.830       7    48    
## 70   8.960       6    51    
## 71   9.090       6    54    
## 72   9.220       6    56    
## 73   9.350       5    57    
## 74   9.479       4    58    
## 75   9.609       4    59    
## 76   9.739       4    63    
## 77   9.869       3    66    
## 78   9.999       3    69    
## 79  10.129       3    73    
## 80  10.259       3    79    
## 81  10.388       3    100   
## 82  10.518       3    118   
## 83  10.648       2    134   
## 84  10.778       2    135   
## 85  10.908       2    135   
## 86  11.038       1    135   
## 87  11.168       1    135   
## 88  11.297       1    135   
## 89  11.427       1    135   
## 90  11.557       1    135   
## 91  11.687       1    135   
## 92  11.817       1    135   
## 93  11.947       1    135   
## 94  12.077       1    135   
## 95  12.206       1    135   
## 96  12.336       1    135   
## 97  12.466       1    135   
## 98  12.596       1    135   
## 99  12.726       1    135   
## 100 12.856       0    135

We can perform cross-validation using the pamr.cv function. Printing the output of this function shows a table of how many genes were used at each threshold, and the number of classification errors. Both these values need to be taken into account when choosing a suit- able theshold. The pamr.plotcv function can assist with this by producing a diagnostic plot which shows how the error changes with the number of genes. In the plot produced by this function there are two panels; the top one shows the errors in the whole dataset and the bottom one considers each class separately. In each panel, the x axis corresponds to the thresh- old (and number of genes at each threshold) whereas the y-axis is the number of misclassifications.

model.cv <- pamr.cv(model , train.dat , nfold = 10)
## 12Fold 1 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 2 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 3 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 4 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 5 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 6 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 7 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 8 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 9 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 10 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
model.cv
## Call:
## pamr.cv(fit = model, data = train.dat, nfold = 10)
##     threshold nonzero errors
## 1    0.000    6221    44    
## 2    0.130    5803    45    
## 3    0.260    5365    45    
## 4    0.390    4926    45    
## 5    0.519    4501    44    
## 6    0.649    4157    44    
## 7    0.779    3789    43    
## 8    0.909    3440    42    
## 9    1.039    3144    40    
## 10   1.169    2884    40    
## 11   1.299    2624    40    
## 12   1.428    2377    39    
## 13   1.558    2197    36    
## 14   1.688    2000    37    
## 15   1.818    1823    39    
## 16   1.948    1663    39    
## 17   2.078    1538    40    
## 18   2.208    1405    40    
## 19   2.337    1281    40    
## 20   2.467    1164    39    
## 21   2.597    1067    39    
## 22   2.727     994    37    
## 23   2.857     910    36    
## 24   2.987     823    35    
## 25   3.117     753    36    
## 26   3.246     697    36    
## 27   3.376     643    36    
## 28   3.506     577    36    
## 29   3.636     522    36    
## 30   3.766     460    35    
## 31   3.896     398    35    
## 32   4.026     360    35    
## 33   4.155     319    35    
## 34   4.285     284    34    
## 35   4.415     246    34    
## 36   4.545     219    34    
## 37   4.675     191    34    
## 38   4.805     170    34    
## 39   4.935     155    34    
## 40   5.064     132    34    
## 41   5.194     117    34    
## 42   5.324     103    36    
## 43   5.454      89    36    
## 44   5.584      77    36    
## 45   5.714      68    36    
## 46   5.844      63    37    
## 47   5.973      57    37    
## 48   6.103      54    38    
## 49   6.233      49    38    
## 50   6.363      44    38    
## 51   6.493      41    38    
## 52   6.623      37    38    
## 53   6.753      34    38    
## 54   6.882      32    39    
## 55   7.012      28    37    
## 56   7.142      27    37    
## 57   7.272      25    39    
## 58   7.402      23    39    
## 59   7.532      19    40    
## 60   7.661      17    41    
## 61   7.791      17    41    
## 62   7.921      17    42    
## 63   8.051      17    43    
## 64   8.181      16    44    
## 65   8.311      15    47    
## 66   8.441      13    47    
## 67   8.570      11    48    
## 68   8.700       8    50    
## 69   8.830       7    51    
## 70   8.960       6    53    
## 71   9.090       6    53    
## 72   9.220       6    56    
## 73   9.350       5    59    
## 74   9.479       4    59    
## 75   9.609       4    61    
## 76   9.739       4    64    
## 77   9.869       3    65    
## 78   9.999       3    73    
## 79  10.129       3    84    
## 80  10.259       3    100   
## 81  10.388       3    116   
## 82  10.518       3    123   
## 83  10.648       2    127   
## 84  10.778       2    133   
## 85  10.908       2    134   
## 86  11.038       1    135   
## 87  11.168       1    135   
## 88  11.297       1    134   
## 89  11.427       1    135   
## 90  11.557       1    135   
## 91  11.687       1    135   
## 92  11.817       1    135   
## 93  11.947       1    135   
## 94  12.077       1    135   
## 95  12.206       1    135   
## 96  12.336       1    135   
## 97  12.466       1    135   
## 98  12.596       1    135   
## 99  12.726       1    135   
## 100 12.856       0    135
pamr.plotcv(model.cv)

In the following sections, feel free to experiment with different values of the threshold (which we will call Delta) The misclassifications can easily be visualised as a ’confusion table’. This simply tabulates the classes assigned to each sample against the original label assigned to the sample. e.g. Misclassifications are samples that we thought were ’ER+’ but have been assigned to the ’ER-’ group by the classifier, or ’ER-’ samples assigned as ’ER+’ by the classifier.

Delta <- 8
pamr.confusion(model.cv , Delta)
##      ER - ER + Class Error rate
## ER -  106   29       0.21481481
## ER +   14  195       0.06698565
## Overall error rate= 0.125

A visual representation of the class separation can be obtained using the pamr.plotcvprob function. For each sample there are two circles representing the probabilty of that sample being classified ER- (red) or ER+ (green).

pamr.plotcvprob(model , train.dat , Delta )

There are a couple of ways of extract the details of the genes that have been used in the classifier. We can list their names using the pamr.listgenes function, which in our case these are just returns the microarray probe names. We can however, use these IDs to query the featureData stored with the original vdx object. We can also plot the expression values for each gene, coloured according to the class label.

pamr.listgenes(model , train.dat , Delta )
##       id          ER --score ER +-score
##  [1,] 205225_at   -0.3257    0.2104    
##  [2,] 209173_at   -0.202     0.1305    
##  [3,] 209603_at   -0.1753    0.1133    
##  [4,] 204508_s_at -0.1184    0.0765    
##  [5,] 205009_at   -0.0987    0.0638    
##  [6,] 214440_at   -0.083     0.0536    
##  [7,] 205186_at   -0.0583    0.0376    
##  [8,] 219197_s_at -0.0507    0.0327    
##  [9,] 209459_s_at -0.0453    0.0292    
## [10,] 212956_at   -0.0425    0.0274    
## [11,] 218976_at   -0.0402    0.026     
## [12,] 203929_s_at -0.035     0.0226    
## [13,] 204623_at   -0.0325    0.021     
## [14,] 215729_s_at 0.0295     -0.0191   
## [15,] 209800_at   0.0211     -0.0136   
## [16,] 218211_s_at -0.018     0.0116    
## [17,] 205862_at   -0.0109    0.0071
classifierGenes <- pamr.listgenes(model , train.dat , Delta )[,1]
##       id          ER --score ER +-score
##  [1,] 205225_at   -0.3257    0.2104    
##  [2,] 209173_at   -0.202     0.1305    
##  [3,] 209603_at   -0.1753    0.1133    
##  [4,] 204508_s_at -0.1184    0.0765    
##  [5,] 205009_at   -0.0987    0.0638    
##  [6,] 214440_at   -0.083     0.0536    
##  [7,] 205186_at   -0.0583    0.0376    
##  [8,] 219197_s_at -0.0507    0.0327    
##  [9,] 209459_s_at -0.0453    0.0292    
## [10,] 212956_at   -0.0425    0.0274    
## [11,] 218976_at   -0.0402    0.026     
## [12,] 203929_s_at -0.035     0.0226    
## [13,] 204623_at   -0.0325    0.021     
## [14,] 215729_s_at 0.0295     -0.0191   
## [15,] 209800_at   0.0211     -0.0136   
## [16,] 218211_s_at -0.018     0.0116    
## [17,] 205862_at   -0.0109    0.0071
pamr.geneplot(model , train.dat ,Delta)

You may get an error message Error in plot.new(): Figure margins too large when trying to produce the gene plot. If this occurs, try increasing the size of your plotting window, or decrease the number of genes by decreasing the threshold. Alternatively, the fol- lowing code will write the plots to a pdf.

pdf ("classifierProfiles.pdf")
for (i in 1: length (classifierGenes)) {
  Symbol <- fData(vdx.filt)[classifierGenes[i] , "Gene.symbol"]
  boxplot(exprs(vdx.filt)[classifierGenes[i], ] ~ erStatus ,
  main = Symbol )
}
dev.off()
## png 
##   2

Use the genes identified by the classifier to produce a heatmap to confirm that they separate the samples as expected.

symbols <- fData(vdx.filt)[classifierGenes , "Gene.symbol"]
heatmap(exprs(vdx.filt)[classifierGenes, ] , labRow = symbols )

Testing the model

We can now test the classifier on an external dataset. We choose the transbig dataset for simplicity as it was generated on the same microarray platform

library (breastCancerTRANSBIG)
data (transbig)
pData (transbig)[1:4, ]
##                samplename  dataset  series   id         filename size age er grade pgr her2 brca.mutation e.dmfs t.dmfs node t.rfs e.rfs treatment tissue t.os e.os
## VDXGUYU_4002 VDXGUYU_4002 TRANSBIG VDXGUYU 4002 GSM177885.CEL.gz  3.0  57  0     3  NA   NA            NA      1    723    0   723     1         0      1  937    1
## VDXGUYU_4008 VDXGUYU_4008 TRANSBIG VDXGUYU 4008 GSM177886.CEL.gz  3.0  57  1     3  NA   NA            NA      0   6591    0   183     1         0      1 6591    0
## VDXGUYU_4011 VDXGUYU_4011 TRANSBIG VDXGUYU 4011 GSM177887.CEL.gz  2.5  48  0     3  NA   NA            NA      1    524    0   524     1         0      1  922    1
## VDXGUYU_4014 VDXGUYU_4014 TRANSBIG VDXGUYU 4014 GSM177888.CEL.gz  1.8  42  1     3  NA   NA            NA      1   6255    0  2192     1         0      1 6255    1
transbig.filt <- transbig [featureNames(vdx.filt) , ]
predClass <- pamr.predict(model ,exprs(transbig.filt) ,Delta )
table (predClass, pData(transbig)$ er)
##          
## predClass   0   1
##      ER -  52  11
##      ER +  12 123
boxplot (pamr.predict(model , exprs(transbig.filt), Delta ,
                           type = "posterior" )[, 1] ~ pData(transbig)$er)

Make a heatmap of the transbig data using the genes involved in the vxd classifier

erLab <- as.factor(pData(transbig)$er)
levels (erLab) <- c ("blue" , "yellow")

heatmap (exprs(transbig.filt)[classifierGenes , ] , labRow = symbols ,
  ColSideColors = as.character (erLab))

Survival Analysis

An attractive feature of the vdx dataset is that it includes survival data for each breast can- cer patient. We are not explicitly covering survival analysis in this course, but for your reference, here are the commands to create survival curves when patients are grouped by ER status and tumour grade.

library (survival)
par (mfrow = c (1 , 2))
plot (survfit (Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
  pData(vdx)$er) , col = c("cyan" , "salmon"))

plot (survfit(Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
  pData (vdx)$grade) , col = c("blue" , "yellow" , "orange"))

survdiff(Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
  pData (vdx)$er)
## Call:
## survdiff(formula = Surv(pData(vdx)$t.dmfs, pData(vdx)$e.dmfs) ~ 
##     pData(vdx)$er)
## 
##                   N Observed Expected (O-E)^2/E (O-E)^2/V
## pData(vdx)$er=0 135       38     45.5     1.244      2.04
## pData(vdx)$er=1 209       80     72.5     0.781      2.04
## 
##  Chisq= 2  on 1 degrees of freedom, p= 0.154
survdiff(Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
  pData(vdx)$grade)
## Call:
## survdiff(formula = Surv(pData(vdx)$t.dmfs, pData(vdx)$e.dmfs) ~ 
##     pData(vdx)$grade)
## 
## n=197, 147 observations deleted due to missingness.
## 
##                      N Observed Expected (O-E)^2/E (O-E)^2/V
## pData(vdx)$grade=1   7        0     3.06      3.06      3.21
## pData(vdx)$grade=2  42       10    16.69      2.68      3.53
## pData(vdx)$grade=3 148       61    51.26      1.85      6.72
## 
##  Chisq= 7.7  on 2 degrees of freedom, p= 0.0218